238 PART 5 Looking for Relationships with Correlation and Regression
one level for the reference group (let’s choose 3), and then create two binary indi-
cator variables for the other two levels — meaning one for 1 = graduated high
school and 2 = graduated college. Here’s another example of coding multilevel
categorical variables as a set of indicator variables, where each level is assigned its
own binary variable that is coded 1 if the level applies to the row, and 0 if it does
not (see Table 17-1).
Table 17-1 shows theoretical coding for a data set containing the variables StudyID
(for participant ID) and PrimaryDx (for participant primary diagnosis). As shown
in Table 17-1, you take each level and make an indicator variable for it: Hyperten-
sion is HTN, diabetes is Diab, cancer is Cancer, and other is OtherDx. Instead of
including the variable PrimaryDx in the model, you’d include the indicator vari-
ables for all levels of PrimaryDx except the reference level. So, if the reference level
you selected for PrimaryDx was hypertension, you’d include Diab, Cancer, and
OtherDx in the regression, but would not include HTN. To contrast this to the edu-
cation example, in the set of variables in Table 17-1, participants can have a 1 for
one or more indicator variables or just be in the reference group. However, with
the education example, they can only be coded at one level, or be in the reference
group.
Don’t forget to leave the reference-level indicator variable out of the regression,
or your model will break!
Creating scatter charts before you jump
into multiple regression analysis
One common mistake researchers make is immediately running a regression or
another advanced statistical analysis before thoroughly examining their data. As
TABLE 17-1
Coding a Multilevel Category into a Set of Binary
Indicator Variables
StudyID
PrimaryDx
HTN
Diab
Cancer
OtherDx
1
Hypertension
1
0
0
0
2
Diabetes
0
1
0
0
3
Cancer
0
0
1
0
4
Other
0
0
0
1
5
Diabetes
0
1
0
0